K‐medoids clustering of hospital admission characteristics to classify severity of influenza virus infection

Abstract Background Patients are admitted to the hospital for respiratory illness at different stages of their disease course. It is important to appropriately analyse this heterogeneity in surveillance data to accurately measure disease severity among those hospitalized. The purpose of this study was to determine if unique baseline clusters of influenza patients exist and to examine the association between cluster membership and in‐hospital outcomes. Methods Patients hospitalized with influenza at two hospitals in Southeast Michigan during the 2017/2018 (n = 242) and 2018/2019 (n = 115) influenza seasons were included. Physiologic and laboratory variables were collected for the first 24 h of the hospital stay. K‐medoids clustering was used to determine groups of individuals based on these values. Multivariable linear regression or Firth's logistic regression were used to examine the association between cluster membership and clinical outcomes. Results Three clusters were selected for 2017/2018, mainly differentiated by blood glucose level. After adjustment, those in C171 had 5.6 times the odds of mechanical ventilator use than those in C172 (95% CI: 1.49, 21.1) and a significantly longer mean hospital length of stay than those in both C172 (mean 1.5 days longer, 95% CI: 0.2, 2.7) and C173 (mean 1.4 days longer, 95% CI: 0.3, 2.5). Similar results were seen between the two clusters selected for 2018/2019. Conclusion In this study of hospitalized influenza patients, we show that distinct clusters with higher disease acuity can be identified and could be targeted for evaluations of vaccine and influenza antiviral effectiveness against disease attenuation. The association of higher disease acuity with glucose level merits evaluation.


| INTRODUCTION
Infectious respiratory diseases caused by influenza virus, respiratory syncytial virus, and SARS-CoV-2 can cause significant illness and are responsible for hundreds of thousands of hospitalizations in the United States annually. 1 Data on in-hospital progression of disease and treatment course are broadly available and used to evaluate severity of illness, 2,3 or the impact of vaccination 4,5 and treatment. [6][7][8] However, the primary cause of admission, particularly in those with baseline multimorbidity, might be due to causes either exacerbated by milder respiratory tract infection (e.g., asthma) or possibly unrelated to infection (e.g., dehydration) rather than acute illness. This might bias results of vaccine or antiviral effectiveness against prevention or attenuation of severe disease. Differences in general health and health care seeking behaviour are difficult to directly measure, 9,10 and individuals may present and be admitted to the hospital at different stages in their disease course with varying disease severity. These patterns vary by population, health system, and specific aetiology. [11][12][13][14] While patients hospitalized with respiratory diseases such as influenza have historically been older with significant comorbidity, 11,15 the pattern has differed in various phases of the COVID-19 pandemic. 16 The heterogeneity of the hospitalized population at admission creates challenges when examining events occurring during hospitalization. Differential baseline comorbidity and presenting symptomology can significantly confound hospital data used as a surveillance metric for respiratory disease severity and can bias estimates of the effectiveness of interventions to reduce influenza morbidity or progression of disease.
Unsupervised machine learning algorithms provide a way to derive and characterize different groups of patients independent of an outcomes or treatment framework. 17,18 When applied to clinical data, this methodology can help identify distinct phenotypes of individuals driven by underlying relationships between health metrics.
The aims of the current study were to develop clinically distinct clusters of patients based on laboratory and physiologic measurements within the first 24 h of hospitalization, to determine if cluster membership was associated with worse in-hospital outcomes, and to evaluate the association of influenza vaccination on in-hospital outcomes within a given cluster. Laboratory data included non-fasting blood glucose, haematocrit, haemoglobin, blood urea nitrogen (BUN), sodium, pH, total white blood cell (WBC) count, creatinine, platelets, bilirubin, and lactic acid. 20 Estimated glomerular filtration rate (eGFR) was computed using the maximum creatinine value within the 24-h window. 21

| Clustering of data
Variables were selected for clustering algorithm inclusion based on clinical relevance to indicating illness severity. Specific variables selected were as follows. Both minimum and maximum values were included unless otherwise specified: Temperature, heart rate (maximum), SBP, blood glucose, creatinine (maximum), haematocrit (minimum), sodium, WBC, platelets (minimum), respiratory rate (maximum), oxygen saturation (minimum), eGFR, and time from symptom onset to admission. Missing data for all selected physiologic measures were imputed using the study population mean stratified by age group , 50-64, 65+) and hospital. A table of selected metrics can be found in Table S1.
Prior to the creation of clusters, Hopkin's statistic was used to assess the randomness of the distribution of the data in relation to a uniform distribution. Values of .5 for this statistic indicate data are similar to the univariate distribution, while values closer to 1 indicate the data may contain clusters. The use of this statistic helps reduce the risk of a machine learning algorithm detecting clusters when the data do not actually have clusters within. 22 Data were classified separately for each influenza season using the k-medoids partitioning around the medoids (PAM) algorithm with Manhattan distance. Briefly, k-medoids clustering assigns groups to a set of data based on the distance to an assigned central data point of a cluster. 23 To start, these medoids are randomly assigned, and the algorithm iterates through selection of data centroids and cluster groupings until the distance from the centroid is minimized to all other data points in the cluster. K-medoids clustering is more robust in the presence of outliers than other centroid-based clustering algorithms such as k-means because the chosen centroid is an observed data point. Additionally, this algorithm assigns all data observations to a cluster; this is preferred in a cohort of hospitalized individuals where biologically plausible data outliers are of interest. The appropriate number of clusters to be assigned for a given season was chosen using the largest average silhouette width, a measure of the distance from points in one cluster to another, with a maximum of 10 clusters tested.
The k-medoids clustering was performed using the "pam" function in R. Following group assignment, the silhouette width of each cluster was computed using the "silhouette" function. An average silhouette width close to 1 indicates perfect clusters, and an average silhouette width around 0 indicates clusters lie close together. A negative silhouette width for a given observation indicates that the data point may have been misclassified.

| Hospitalization severity metrics
Outcomes considered for severity of illness during the hospitalization included intensive care unit (ICU) admission, mechanical ventilator use, and total hospital length of stay (continuous, and prolonged defined as ≥8 days 24 ). To determine if different classes of early hospitalization characteristics were associated with severe hospital sequelae, a series of models were constructed separately for each influenza season. For binary outcomes (ICU admission, ventilator use, and prolonged hospital length of stay), Firth's logistic regression models were constructed. Generalized linear models were used for the continuous outcome of hospital length of stay. Variables chosen a priori for model inclusion were k-medoids cluster, age, sex, CCI (continuous), hospital, and influenza vaccination status. An exploratory analysis was conducted as above with the removal of outliers prior to clustering, with an outlier conservatively defined as a value <1st quartile-1.5 * (interquartile range) or >3rd quartile + 1.5 * (interquartile range). Outliers were then imputed to the mean of remaining values stratified by age group and hospital. To maintain comparability with the primary analysis, the same number of clusters were implemented within a given influenza year.

| Statistical analysis
Analysis was conducted using RStudio version 1.  (Table S2)  . Those in C 17 2 had significantly lower maximum heart rate, maximum systolic blood pressure, minimum white blood cell count, minimum platelets, and estimated glomerular filtration rate than the other two clusters. Those in C 17 2 also had significantly higher maximum creatinine than the other clusters. The rate of being mechanically ventilated was higher in C 17 1 than C 17 2 (22.6% vs. 6.9%), and the overall hospital length of stay was longer for those in C 17  After adjustment for age group, sex, hospital, continuous CCI, and influenza vaccination status, those in C 17 1 had 5.6 times the odds of having a mechanical ventilator than those in C 17 2 (95% CI: 1.49, 21.1; Figure 2). Additionally, those in C 17 1 had a significantly longer modeladjusted mean hospital length of stay than those in both C 17 2 (mean 1.5 days longer, 95% CI: 0.2, 2.7) and C 17 3 (mean 1.4 days longer, 95% CI: 0.3, 2.5). There were no significant differences between clusters for the outcomes of ICU stay or prolonged hospital stay. Vaccination status was not associated with adverse outcomes in the fully adjusted models.      We found glucose to be significantly different between clusters, with one cluster having significantly higher glucose in both years. The distribution of diabetes was also consistent across years, with approximately 70% prevalence in the high-glucose clusters and 30%  [25][26][27] This challenge is due in part to differential For example, a 2020 study by Groeneveld et al examining the effective of oseltamivir lost 36% of oseltamivir patients and 65% of controls when matching, reducing the sample size to 88 pairs. 6 While use of propensity score matching has been shown to reduce bias, 28 such significant loss of data, especially in a rare-outcomes setting, may lead to an increase in Type II error, and thus incorrect conclusions, due to inadequate power. 29 nificance, as these individuals may be at higher risk for adverse outcomes. K-medoids clustering is robust to such outliers through use of data-derived centroids for the clusters, rather than an arbitrary mean.

| Sensitivity analysis
This study has several strengths, most notably that the cohort was nested within a large prospective two-centre study of influenza vaccine effectiveness across multiple seasons, allowing for a robust and diverse analytic cohort. Both case definition and EHR data capture were standardized across sites, reducing heterogeneity of data quality. Additionally, the use of two hospitals within our region allowed for a more generalizable analysis. The biggest limitation of the study is small sample size and small number of outcomes, and due to missing data we were unable to adjust for vaccination type in modelling; however, we believe our analysis has minimized some of the bias from these limitations. Finally, while the k-medoids clusters presented here may not generalizable to other cohorts, the methodology has many direct and current applications in severity analysis.
One of the most immediate applications can be for evaluating the effectiveness of new and existing antivirals for severe respiratory disease. Previous studies of such treatments have utilized traditional methods of covariate adjustment, which may contribute to heterogeneity of study findings. 31 The use of this clustering method to phenotype baseline presentation can reduce this confounding and can be quickly implemented for these analyses. Such a technique will be needed as we continue understand how new antiviral treatments affect severity, and how vaccination impacts severity in instances of low vaccine effectiveness.

| CONCLUSIONS
In conclusion, we found it was possible to cluster adult patients hospitalized with influenza into clinically distinct groups by